distribution unlimited
SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
Zenith, Ayush, Zumbrun, Arnold, Raut, Neel, Lin, Jing
The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > Broome County > Binghamton (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (9 more...)
Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs
Gupta, Ayush, Kaur, Ramneet, Roy, Anirban, Cobb, Adam D., Chellappa, Rama, Jha, Susmit
We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2\%$ to $37\%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.
- Health & Medicine (1.00)
- Government > Military (1.00)
- Government > Regional Government > North America Government > United States Government (0.68)
A Data-Based Architecture for Flight Test without Test Points
Harp, D. Isaiah, Ott, Joshua, Alora, John, Asmar, Dylan
The justification for the "test point" derives from the test pilot's obligation to reproduce faithfully the pre-specified conditions of some model prediction. Pilot deviation from those conditions invalidates the model assumptions. Flight test aids have been proposed to increase accuracy on more challenging test points. However, the very existence of databands and tolerances is the problem more fundamental than inadequate pilot skill. We propose a novel approach, which eliminates test points. We start with a high-fidelity digital model of an air vehicle. Instead of using this model to generate a point prediction, we use a machine learning method to produce a reduced-order model (ROM). The ROM has two important properties. First, it can generate a prediction based on any set of conditions the pilot flies. Second, if the test result at those conditions differ from the prediction, the ROM can be updated using the new data. The outcome of flight test is thus a refined ROM at whatever conditions were flown. This ROM in turn updates and validates the high-fidelity model. We present a single example of this "point-less" architecture, using T-38C flight test data. We first use a generic aircraft model to build a ROM of longitudinal pitching motion as a hypersurface. We then ingest unconstrained flight test data and use Gaussian Process Regression to update and condition the hypersurface. By proposing a second-order equivalent system for the T-38C, this hypersurface then generates parameters necessary to assess MIL-STD-1797B compliance for longitudinal dynamics.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > San Luis Obispo County > San Luis Obispo (0.04)
- Transportation > Air (1.00)
- Government > Military > Air Force (1.00)
- Aerospace & Defense > Aircraft (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
Padhi, Trilok, Kaur, Ramneet, Cobb, Adam D., Acharya, Manoj, Roy, Anirban, Samplawski, Colin, Matejek, Brian, Berenbeim, Alexander M., Bastian, Nathaniel D., Jha, Susmit
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
Spill The Beans: Exploiting CPU Cache Side-Channels to Leak Tokens from Large Language Models
Side-channel attacks on shared hardware resources increasingly threaten confidentiality, especially with the rise of Large Language Models (LLMs). In this work, we introduce Spill The Beans, a novel application of cache side-channels to leak tokens generated by an LLM. By co-locating an attack process on the same hardware as the victim model, we flush and reload embedding vectors from the embedding layer, where each token corresponds to a unique embedding vector. When accessed during token generation, it results in a cache hit detectable by our attack on shared lower-level caches. A significant challenge is the massive size of LLMs, which, by nature of their compute intensive operation, quickly evicts embedding vectors from the cache. We address this by balancing the number of tokens monitored against the amount of information leaked. Monitoring more tokens increases potential vocabulary leakage but raises the chance of missing cache hits due to eviction; monitoring fewer tokens improves detection reliability but limits vocabulary coverage. Through extensive experimentation, we demonstrate the feasibility of leaking tokens from LLMs via cache side-channels. Our findings reveal a new vulnerability in LLM deployments, highlighting that even sophisticated models are susceptible to traditional side-channel attacks. We discuss the implications for privacy and security in LLM-serving infrastructures and suggest considerations for mitigating such threats. For proof of concept we consider two concrete attack scenarios: Our experiments show that an attacker can recover as much as 80%-90% of a high entropy API key with single shot monitoring. As for English text we can reach a 40% recovery rate with a single shot. We should note that the rate highly depends on the monitored token set and these rates can be improved by targeting more specialized output domains.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Token embeddings violate the manifold hypothesis
Robinson, Michael, Dey, Sourya, Chiang, Tony
To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.'' Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > Michigan (0.04)
- (2 more...)
Probing the topology of the space of tokens with structured prompts
Robinson, Michael, Dey, Sourya, Kushner, Taisa
The set of tokens T, when embedded within the latent space X of a large language model (LLM) can be thought of as a finite sample drawn from a distribution supported on a topological subspace of X. One can ask what the smallest (in the sense of inclusion) subspace and simplest (in terms of fewest free parameters) distribution can account for such a sample. Previous work[1] suggests that the smallest topological subspace from which tokens can be drawn is not manifold, but has structure consistent with a stratified manifold. That paper relied upon knowing the token input embedding function T X, which given each token t T, ascribes a representation in X. Because embeddings preserve topological structure, in this paper, we will study T by equating it with the image of the token input embedding function, thereby treating T both as the set of tokens and as a subspace of X. This subspace is called the token subspace of X. Usually X is taken to be Euclidean space R
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > New York (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Open-Source High-Speed Flight Surrogate Modeling Framework
Korenyi-Both, Tyler E., Falkiewicz, Nathan J., Jones, Matthew C.
High-speed flight vehicles, which travel much faster than the speed of sound, are crucial for national defense and space exploration. However, accurately predicting their behavior under numerous, varied flight conditions is a challenge and often prohibitively expensive. The proposed approach involves creating smarter, more efficient machine learning models (also known as surrogate models or meta models) that can fuse data generated from a variety of fidelity levels -- to include engineering methods, simulation, wind tunnel, and flight test data -- to make more accurate predictions. These models are able to move the bulk of the computation from high performance computing (HPC) to single user machines (laptop, desktop, etc.). The project builds upon previous work but introduces code improvements and an informed perspective on the direction of the field. The new surrogate modeling framework is now modular and, by design, broadly applicable to many modeling problems. The new framework also has a more robust automatic hyperparameter tuning capability and abstracts away most of the pre- and post-processing tasks. The Gaussian process regression and deep neural network-based models included in the presented framework were able to model two datasets with high accuracy (R^2>0.99). The primary conclusion is that the framework is effective and has been delivered to the Air Force for integration into real-world projects. For future work, significant and immediate investment in continued research is crucial. The author recommends further testing and refining modeling methods that explicitly incorporate physical laws and are robust enough to handle simulation and test data from varying resolutions and sources, including coarse meshes, fine meshes, unstructured meshes, and limited experimental test points.
- North America > United States (1.00)
- Europe > Denmark (0.14)
- Government > Regional Government > North America Government > United States Government (1.00)
- Energy > Oil & Gas (0.93)
- Government > Military > Air Force (0.69)
Quantum Boltzmann machine learning of ground-state energies
Patel, Dhrumil, Koch, Daniel, Patel, Saahil, Wilde, Mark M.
Estimating the ground-state energy of Hamiltonians is a fundamental task for which it is believed that quantum computers can be helpful. Several approaches have been proposed toward this goal, including algorithms based on quantum phase estimation and hybrid quantum-classical optimizers involving parameterized quantum circuits, the latter falling under the umbrella of the variational quantum eigensolver. Here, we analyze the performance of quantum Boltzmann machines for this task, which is a less explored ansatz based on parameterized thermal states and which is not known to suffer from the barren-plateau problem. We delineate a hybrid quantum-classical algorithm for this task and rigorously prove that it converges to an $\varepsilon$-approximate stationary point of the energy function optimized over parameter space, while using a number of parameterized-thermal-state samples that is polynomial in $\varepsilon^{-1}$, the number of parameters, and the norm of the Hamiltonian being optimized. Our algorithm estimates the gradient of the energy function efficiently by means of a novel quantum circuit construction that combines classical sampling, Hamiltonian simulation, and the Hadamard test, thus overcoming a key obstacle to quantum Boltzmann machine learning that has been left open since [Amin et al., Phys. Rev. X 8, 021050 (2018)]. Additionally supporting our main claims are calculations of the gradient and Hessian of the energy function, as well as an upper bound on the matrix elements of the latter that is used in the convergence analysis.
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- North America > United States > New York > New York County > New York City (0.04)
The structure of the token space for large language models
Robinson, Michael, Dey, Sourya, Sweet, Shauna
Large language models encode the correlational structure present in natural language by fitting segments of utterances (tokens) into a high dimensional ambient latent space upon which the models then operate. We assert that in order to develop a foundational, first-principles understanding of the behavior and limitations of large language models, it is crucial to understand the topological and geometric structure of this token subspace. In this article, we present estimators for the dimension and Ricci scalar curvature of the token subspace, and apply it to three open source large language models of moderate size: GPT2, LLEMMA7B, and MISTRAL7B. In all three models, using these measurements, we find that the token subspace is not a manifold, but is instead a stratified manifold, where on each of the individual strata, the Ricci curvature is significantly negative. We additionally find that the dimension and curvature correlate with generative fluency of the models, which suggest that these findings have implications for model behavior.
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)